Scotty is a ride-sharing business operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with a real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
Scotty: “Bring me the crystal ball!” It’s almost the end of 2017 and we need to prepare a forecast model to helps Scotty ready for the end year’s demands. Unfortunately, Scotty is not old enough to have last year’s data for December, so we can not look back at past demands to prepare forecast for December’s demands. Fortunately, you already know that time series analysis is more than enough to help us to forecast! But, as an investment for the business’ future, we need to develop an automated forecasting framework so we don’t have to meddle with forecast model selection anymore in the future! Build an automated forecasting model for hourly demands that would be evaluated on the next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017)!
Library use in this project.
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(magrittr)
library(tidymodels)## Warning: package 'tidymodels' was built under R version 4.0.3
## -- Attaching packages ---------------------- tidymodels 0.1.1 --
## v broom 0.7.0 v recipes 0.1.13
## v dials 0.0.9 v rsample 0.0.8
## v dplyr 1.0.2 v tibble 3.0.3
## v ggplot2 3.3.2 v tidyr 1.1.2
## v infer 0.5.3 v tune 0.1.1
## v modeldata 0.0.2 v workflows 0.2.1
## v parsnip 0.1.4 v yardstick 0.0.7
## v purrr 0.3.4
## Warning: package 'dials' was built under R version 4.0.3
## Warning: package 'infer' was built under R version 4.0.3
## Warning: package 'modeldata' was built under R version 4.0.3
## Warning: package 'rsample' was built under R version 4.0.3
## Warning: package 'tune' was built under R version 4.0.3
## Warning: package 'workflows' was built under R version 4.0.3
## Warning: package 'yardstick' was built under R version 4.0.3
## -- Conflicts ------------------------- tidymodels_conflicts() --
## x yardstick::accuracy() masks forecast::accuracy()
## x purrr::discard() masks scales::discard()
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## x recipes::step() masks stats::step()
library(tidyverse)## -- Attaching packages ----------------------- tidyverse 1.3.0 --
## v readr 1.3.1 v forcats 0.5.0
## v stringr 1.4.0
## -- Conflicts -------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x readr::col_factor() masks scales::col_factor()
## x lubridate::date() masks base::date()
## x purrr::discard() masks scales::discard()
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## x lubridate::setdiff() masks base::setdiff()
## x readr::spec() masks yardstick::spec()
## x lubridate::union() masks base::union()
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(yardstick)scotty <- read_csv("data/data-train.csv")## Parsed with column specification:
## cols(
## id = col_character(),
## trip_id = col_character(),
## driver_id = col_character(),
## rider_id = col_character(),
## start_time = col_datetime(format = ""),
## src_lat = col_double(),
## src_lon = col_double(),
## src_area = col_character(),
## src_sub_area = col_character(),
## dest_lat = col_double(),
## dest_lon = col_double(),
## dest_area = col_character(),
## dest_sub_area = col_character(),
## distance = col_double(),
## status = col_character(),
## confirmed_time_sec = col_double()
## )
head(scotty)glimpse(scotty)## Rows: 90,113
## Columns: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066affcfa261708...
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", NA, "59d006c131e39c6...
## $ driver_id <chr> "59a892c5568be44b2734f276", NA, "599dc0dfa5b4fd5...
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59cd704bcf482f6ce2f...
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-...
## $ src_lat <dbl> 41.07047, 41.07487, 41.04995, 41.05287, 41.06760...
## $ src_lon <dbl> 29.01945, 28.99528, 29.03107, 28.99522, 28.98827...
## $ src_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ src_sub_area <chr> "sxk9s", "sxk9e", "sxk9s", "sxk9e", "sxk9e", "sx...
## $ dest_lat <dbl> 41.11716, 41.08351, 41.04495, 41.08140, 41.02125...
## $ dest_lon <dbl> 29.03650, 29.00228, 28.98192, 28.98197, 29.11316...
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", ...
## $ dest_sub_area <chr> "sxk9u", "sxk9e", "sxk9e", "sxk9e", "sxk9q", "sx...
## $ distance <dbl> 5.379250, 1.126098, 4.169492, 3.358296, 11.69357...
## $ status <chr> "confirmed", "nodrivers", "confirmed", "confirme...
## $ confirmed_time_sec <dbl> 8, 0, 32, 65, 0, 27, 32, 30, 0, 24, 9, 0, 118, 1...
scotty %>%
is.na() %>%
colSums()## id trip_id driver_id rider_id
## 0 4676 4676 0
## start_time src_lat src_lon src_area
## 0 0 0 0
## src_sub_area dest_lat dest_lon dest_area
## 0 0 0 0
## dest_sub_area distance status confirmed_time_sec
## 0 0 0 0
We need to round the time component into an hour unit for easy data analysis, as it makes no sense to predict with a minute pattern or a second pattern in this case.
data_clean <- scotty %>%
mutate(datetime = floor_date(x = start_time,
unit = "hour"))We group data and summarise
data_clean <- data_clean %>%
group_by(src_sub_area, datetime) %>%
summarise(demand = n())## `summarise()` regrouping output by 'src_sub_area' (override with `.groups` argument)
We need to do time series padding because in time series modelling, the data must be sorted based on time, time must not be missed, and there should be no missing value.
range(data_clean$datetime)## [1] "2017-10-01 00:00:00 UTC" "2017-12-02 23:00:00 UTC"
library(padr)
data_clean <- data_clean %>%
group_by(src_sub_area) %>%
pad(start_val = ymd_hms("2017-10-01 00:00:00"),
end_val = ymd_hms("2017-12-02 23:00:00")) %>%
ungroup()## pad applied on the interval: hour
data_clean <- data_clean %>%
mutate(demand = replace_na(data = demand,
replace = 0))
data_cleanVisualization of demand data per sub area can be seen in the plot below.
data_clean %>%
ggplot(aes(x = datetime, y=demand)) +
geom_line(aes(col = src_sub_area)) +
facet_wrap(facets = "src_sub_area",ncol = 1) +
theme(legend.position = "none")data_clean %>%
filter(src_sub_area == "sxk97") %>%
.$demand %>%
ts(frequency = 24) %>%
decompose() %>%
autoplot() +
labs(title = "Daily Decomposition") +
theme(plot.title = element_text(hjust = 0.5))data_clean %>%
filter(src_sub_area == "sxk97") %>%
.$demand %>%
ts(frequency = 24*7) %>%
decompose() %>%
autoplot() +
labs(title = "Weekly Decomposition") +
theme(plot.title = element_text(hjust = 0.5))data_clean %>%
filter(src_sub_area == "sxk97") %>%
.$demand %>%
msts(seasonal.periods = c(24, 24*7)) %>%
mstl() %>%
autoplot() +
labs(title = "Multi Seasonal Decomposition") +
theme(plot.title = element_text(hjust = 0.5))Cross Validation by dividing data into Train data (to train models) and Test data (to evaluate models). Test data is obtained from last week’s observations and the rest is fed into train data. We use rolling origin method.
test_size <- 24 * 7
test_end <- max(data_clean$datetime)
test_start <- test_end - hours(test_size) + hours(1)
train_end <- test_start - hours(1)
train_start <- min(data_clean$datetime)
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)data_clean %<>%
mutate(sample = case_when(
datetime %within% intrain ~ "train",
datetime %within% intest ~ "test"
)) %>%
drop_na()
head(data_clean)We need to scale the data by do spread, recipe and bake the data
data_clean %<>%
spread(src_sub_area, demand)
rec <- recipe(~ ., filter(data_clean)) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
data_clean <- bake(rec, data_clean)Convert the data into long format again
data_clean %<>%
gather(src_sub_area, demand, -datetime,-sample)
head(data_clean)After that we need to returns the data to a value with the actual scale with revert
rec_revert <- function(vector, rec, varname) {
rec_center <- rec$steps[[2]]$means[varname]
rec_scale <- rec$steps[[3]]$sds[varname]
results <- (vector * rec_scale + rec_center) ^ 2
results <- round(results)
results
}We need to nest training data and Test data to facilitate the model selection process, before nesting we need to group the data by the source area.
data_clean %<>%
group_by(src_sub_area, sample) %>%
nest(.key = "data") %>%
spread(sample, data)## Warning: `.key` is deprecated
data_cleanCreate object time series and multiple seasonality time series. I’m going to try both types of time series objects on the modeling and see which object gives the smallest error when the model does the prediction.
data_funs <- list(
ts = function(x) ts(x$demand, frequency = 24),
msts = function(x) msts(x$demand, seasonal.periods = c(24, 24 * 7))
)data_funs %<>%
rep(length(unique(data_clean$src_sub_area))) %>%
enframe("data_fun_name", "data_fun") %>%
mutate(src_sub_area =
sort(rep(unique(data_clean$src_sub_area), length(unique(.$data_fun_name))))
)data_clean %<>% left_join(data_funs)## Joining, by = "src_sub_area"
head(data_clean)Next, create a list of model algorithms that you want to apply to the dataset. 1. Exponential Smoothing State Space Model (ets) 2. Seasonal and Trend decomposition using Loess (stlm) 3. Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonal components (tbats) 4. Autoregressive integrated moving average (arima) 5. Holt Winter
I’ll try all those models, then see which model gives the smallest error when the model predicts. Especially for ets and arima models, it is not applied to multiple seasonal time series objects because they are not compatible with those objects.
models <- list(
ets = function(x) ets(x),
stlm = function(x) stlm(x),
tbats = function(x) tbats(x),
auto.arima = function(x) auto.arima(x),
holt.winter = function(x) HoltWinters(x,seasonal = "additive")
)
models %<>%
rep(length(unique(data_clean$src_sub_area))) %>%
enframe("model_name", "model") %>%
mutate(src_sub_area =
sort(rep(unique(data_clean$src_sub_area), length(unique(.$model_name))))
)
modelsdata_clean %<>%
left_join(models) %>%
filter(!(model_name == "ets" & data_fun_name == "msts"),
!(model_name == "auto.arima" & data_fun_name == "msts"))## Joining, by = "src_sub_area"
head(data_clean)Applies the time series object creation function and modeling algorithm to the dataset.
data_clean %<>%
mutate(
params = map(train, ~ list(x = .x)),
data = invoke_map(data_fun, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)## Warning: Problem with `mutate()` input `fitted`.
## i optimization difficulties: ERROR: ABNORMAL_TERMINATION_IN_LNSRCH
## i Input `fitted` is `invoke_map(model, params)`.
## i The error occurred in group 1: src_sub_area = "sxk97".
## Warning in HoltWinters(x, seasonal = "additive"): optimization difficulties:
## ERROR: ABNORMAL_TERMINATION_IN_LNSRCH
data_cleanWe used expanded data to forecast
scotty_improve <- data_clean %>%
mutate(error =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2_dbl(test, ~ mae_vec(truth = rec_revert(.y$demand,rec,src_sub_area), estimate = rec_revert(.x$mean,rec,src_sub_area)))) %>%
arrange(src_sub_area, error)
scotty_improve %>%
select(src_sub_area, ends_with("_name"), error)Beside measuring the error, we can also compare the forecast results to the real test series through graphical analysis.
Creating forecast
data_test <- scotty_improve %>%
mutate(
forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(test, ~ tibble(
datetime = .y$datetime,
demand = as.vector(.x$mean)
)),
key = paste(data_fun_name, model_name, sep = "-")
)
data_test %<>%
select(src_sub_area, key, actual = test, forecast) %>%
spread(key, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand,rec,src_sub_area))Visualization
viz1 <- data_test %>%
ggplot(aes(x = datetime, y = demand)) +
geom_line(data = data_test %>% filter(key == "actual"),aes(y = demand),alpha = 0.2,size = 0.8) +
geom_line(data = data_test %>% filter(key != "actual"),aes(frame = key,col = key)) +
labs(x = "", y = "Order",title = "COMPARISON OF MODEL PREDICTION RESULTS", frame = "Models") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
scale_color_manual(values = c("forestgreen",
"forestgreen",
"forestgreen",
"forestgreen",
"forestgreen",
"forestgreen",
"forestgreen",
"forestgreen",
"forestgreen")) +
theme_test(base_family = "Bebas Neue") +
theme(legend.position = "none")## Warning: Ignoring unknown aesthetics: frame
ggplotly(viz1)## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning in p$x$data[firstFrame] <- p$x$frames[[1]]$data: number of items to
## replace is not a multiple of replacement length
Choose the best model automatically, model with smallest error.
scotty_best <- scotty_improve %>%
select(-fitted) %>%
group_by(src_sub_area) %>%
filter(error == min(error)) %>%
ungroup()scotty_best %<>%
mutate(fulldata = map2(train, test, ~ bind_rows(.x, .y))) %>%
select(src_sub_area, fulldata, everything(), -train, -test)
scotty_bestModel tbats have the smallest predictive error for each sub-area. Next combine test data and Train data to create the final model. Modeling by modeling the tbats against the datasets that have been merged, then predictions the request for the next 7 days.
scotty_best %<>%
mutate(
params = map(fulldata, ~ list(x = .x)),
data = invoke_map(data_fun, params),
params = map(data, ~ list(x = .x)),
fitted = invoke_map(model, params)
) %>%
select(-data, -params)
scotty_best %<>%
mutate(forecast =
map(fitted, ~ forecast(.x, h = 24 * 7)) %>%
map2(fulldata, ~ tibble(
datetime = timetk::tk_make_future_timeseries(.y$datetime, 24 * 7),
demand = as.vector(.x$mean)
))
)
scotty_bestOpen the nest data and make visualization from prediction
scotty_best %<>%
select(src_sub_area, actual = fulldata, forecast) %>%
gather(key, value, -src_sub_area) %>%
unnest(value) %>%
mutate(demand = rec_revert(demand,rec,src_sub_area))
viz2 <- scotty_best %>%
ggplot(aes(x = datetime, y = demand, colour = key)) +
geom_line() +
labs(y = "Order", x = NULL, title = "Prediction Result") +
facet_wrap(~ src_sub_area, scale = "free_y", ncol = 1) +
scale_color_brewer(palette = "Pastel2") +
theme_test(base_family = "Bebas Neue") +
theme(legend.position = "none")
ggplotly(viz2)predict <- scotty_best %>%
filter(key == "forecast") %>%
select(-key)
head(predict)write_csv(predict,"data/submission-jans.csv")Model Performance Result
knitr::include_graphics("data/SS.png")Criteria of a good time-series model is the absence of Autocorrelation and Residual Normality
Sxk97
Box.test(residuals(scotty_improve$fitted[[1]]), type = "Ljung-Box")##
## Box-Ljung test
##
## data: residuals(scotty_improve$fitted[[1]])
## X-squared = 0.0041596, df = 1, p-value = 0.9486
Sxk9e
Box.test(residuals(scotty_improve$fitted[[2]]), type = "Ljung-Box")##
## Box-Ljung test
##
## data: residuals(scotty_improve$fitted[[2]])
## X-squared = 5.4407, df = 1, p-value = 0.01967
Sxk9s
Box.test(residuals(scotty_improve$fitted[[3]]), type = "Ljung-Box")##
## Box-Ljung test
##
## data: residuals(scotty_improve$fitted[[3]])
## X-squared = 15.015, df = 1, p-value = 0.0001067
Sxk97: No Autocorrelation Sxk9e and Sxk9s: Autocorrelation
par(mfrow = c(1,3))
sxk97 <- hist(residuals(scotty_improve$fitted[[1]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk97")
sxk9e <- hist(residuals(scotty_improve$fitted[[2]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk9e")
sxk9s <- hist(residuals(scotty_improve$fitted[[3]]), breaks = 30,
col = "chocolate", border = "red", main = "sxk9s")From the graph, visually that error data is distributed normally.
Assumptions for Sxk9e and Sxk9s are not met, meaning there is still lagging information to be used to calculate the forecast results.